# Côtehaus
Côtehaus is a house in the coast.
In the land of hopes, Côtehaus is to become a Java 7 subset syntax directed Dalvik bytecode translator. Also known as a compiler.

# License
It is released dually under The MIT License and The GNU General Public License version 3, with the exception of file `src/main/types.h` which includes copied code governed by a separate license as pointed out from that particular file. For the general case, see LICENSE file.

# Developer notes

## Toolchain installation

If you have a Debian system, a `Makefile` is in Côtehaus to automate the install of tools.
So you run `make debian_install_zlib` and
`make debian_install_polarssl`.

If you are wandering the Cygwin ramblings, you have to install yourself the packages `gcc-core`, `libgcc1`, `mingw64-x86_64-gcc-core` (that will provide the command `mingw64-x86_64-gcc`), `mingw64-i686-gcc-core` (that will provide the command `mingw64-i686-gcc`), `libiconv-devel`, `bison` and `make`. Check the installed packages by running `cygcheck iconv`, `cygcheck gcc`, `gcc -v` and
`x86_64-w64-mingw32-gcc.exe -v`.
If your choice is neither Debian nor Cygwin, theese instructions may turn out a bit lacking, you're unwound at luck and creativity.

Whichever way you go, get done the job of installing `zlib` and `polarssl`. Read the `Makefile` if you need to know more about them.


## Functional definition

Côtehaus may be seen as divided in phases: *input*, *processing* and *output*.

Being at input phase means to take an input file and while seeing it make records of things useful to recognize about that input file. The input file is a Java program, as in this example:

~~~
class A {
	int add(int a, int b) {
		return a+b;
	}
}
~~~

The prototype `Iadd(II)` present in the input file (derived from `int add(int a, int b)`) is to Côtehaus important to record during the input phase.

After the input phase finishes, Côtehaus has a lot of records derived from things seen in the input files, like `Iadd(II)` and many others.

Then the processing phase begins, and the goal of this phase is to set the records in a defined order. Let's take this example: `add(II)` and `Isubstract(II)`. They are 2 prototype records that might be produced but in an unexpected order, let's say `Isubstract(II)` is produced first and `Iadd(II)` is later. Côtehaus knows that the expected order is right the invert of the produced order. This time we let ourselves trust Côtehaus to determine the expected order, because from a point of view, such a job is what a compiler is to make for us. During the processing phase `Iadd(II)` is set first and `Isubstract(II)` later.

All the records are persisted to an output file at the output phase. Sending messages, respecting conventions, using bitpatterns, informing how much size a prototype's array occupies, are among things that describe the output phase and have the output file done.


## Technical concerns

### Relating source code to phases and items

The way how the code is written, there are naming customs used for each one of the mentioned phases. For example `add_...` for the input phase, `build_o...` for the processing phase and `pack_...` for the output phase.

The name of the item belongs on the [Dalvik's format specification][2], `string_id_item` and `proto_id_item` are examples of items defined in that specification. These names might be further shortened in Côtehaus, as in `str_id` and `proto_id`, that should be easy to guess which is which one.

Figure out which **item** is handled on which **phase** by which **piece of source code** is explained next. Take a name stub from the shown above, like `add_...` for the input phase, then another name stub like `proto_id`. Put them together, and if you see a function named `add_proto_id` somewhere arround the source code, you'll know that function handle the input of a `proto_id_item`, of course of the input phase.

**Exercise**

From the table below.

| Phase        | item        |
|--------------|-------------|
| `add_...`    | `str_id`    |
| `build_o...` | `proto_id`  |
| `pack...`    | `class_def` |

Tell what should be the functions that:
- Handle a `string_id_item` during the output phase.
- Handle a `proto_id_item` during the processing phase.

**Answer**

- Function `pack_str_id` handles a `string_id_item` during the output phase.
- Function `build_oproto_id` handles a `proto_id_item` during the processing phase.

### Input phase

To follow the example of `proto_id` used above, a `proto_id_item` may be seen as defined in theese terms: `tok = elements of (ShortyDescriptor_tok, TypeDescriptor_tok, TypeDescriptor_tok, TypeDescriptor_tok)`. Said so, a `proto_id` is defined in terms of all tokens (`tok`). Each token is then recorded as a string with a call to `add_str_id`.
The records placed by `add_str_id` are chained in a list without any expected order. As many times `add_str_id` is called, a countet increments to carry the quantity of records placed into that chained list. Exercise: look at the code of `add_str_id` and try to tell the name of the list and the name of the counter.

At the input phase, the file `java7.y` encodes information about what to do with a Java program input file. It leverages a subset of the Java language at its 7th version. You need a tool called Bison that puts `java7.y` into terms that Côtehaus doesn't understand but computer machines you use do, a fact just fulfilling to Côtehaus. So, `java7.y` encodes how to distinguish the several parts of a Java program. If speaking about a prototype as one of that many parts, `java7.y` associates a prototype with calls to `add_proto_id`, which in turn makes calls to `add_str_id`, putting the records in place, into either chained lists, one for prototypes, the other for strings.

Let's see an `add_proto_id` function depict:
~~~
struct proto_id_st *add_proto_id(/*things relative to a proto_id*/) {
  struct proto_id_st *proto_id = /*mallocs to a proto_id*/;
  
  /*set proto_id relative things, making malloc or using a function parameter directly:
  
  proto_id->a_member = assign a function parameter as is;
  
  proto_id->another_member = malloc based on another function parameter (apply some logic);
  
  etc.
  */
  
  /* Chain the new proto_id object i to a list with all of the rest of  proto_id's.
  "_list" is by custom part of the chain name. This is so to proto_id as to any item type.
  "_head" is by custom part of the name of the chain head for each item type
  */
  list_add(&proto_id->proto_id_list, &proto_id_head);
  
  // count the items in the chain
  proto_id_list_size++;
  
  return proto_id;
}
~~~

### Processing phase

At the processing phase the ordering (sort) of records happen, taking the records somewhat closer to how Dalvik's format expects, that is, the expected order.

There are upstream functions doing the job, like `build_oproto_id`. Functions like that lean on downstream functions that make the bottom half of the job, as is the case of `oproto_id_st_compar`. Arrays gain a place in processing phase, while the records they hold are those that chained lists used to in the input phase. Arrays are to work in compass with `..._compar` functions, together they accomplish the goal of sorting the records.

`ostr_id_ary` is an ordered array of strings. `o` is for **ordered**, `_ary` is for **array**, just name customs. `str_id` is the part of the name shortened from Dalvik's format spec. `_idx` is an atribute where an ascending numerical value is assigned to each record, considering that if more than one `str_id` has as its string target one with the same string value as another one, only one is taken to represent all of them (an *equivalence class representative*) and only such one is assigned an `_idx`. Because this `_idx` is assigned over the sorted array, then this `_idx` conforms to what Dalvik's format expects about array indexes.

Let's go ahead from an example based on a same one used before.

~~~
class A {
	int add(int a, int b) {
		return a+b;
	}
	
	int substract(int a, int b) {
	    return a-b;
	}
}
~~~

The break down of a *prototype* can be reasoned as a *prototype* has a *shorty description*, a *return type*, and an *array of parameter types*. For the case of `int add(int a, int b)`-derived prototype, let's allow the conceeded wisdom that its  shorty description is `I(II)`. For `int substract(int a, int b)` it is the same at all, `I(II)`, because it has the same return type and the same parameter number and types.

The break down of a *prototype* to the deepest level is just the break down of each of its parts. So, a *return type* is indeed a *type*, an *array of parameter types* is an *array* of which each element is a *type*, and a *type* points at a *string_id* which in turn points at a *string data* to hold the string representing that type's descriptor. By the way, this way of reasoning can be refered to as *recursive*. In the chained list of *string data*, we have a boilerplate of `I`'s records produced at the input phase.

After records are ordered all `I` are set contiguous. Then the boilerplate is eliminated by taking just one `I` from the *string data array* as the *representative* for an [*equivalence class*][1] for all the `I`'s. This process happens at the processing phase and transits the whole break down. An *equivalence class representative* is choosen from the *string ids*, then from the *types*, and widening even beyond the extent, `I(II)` is the same for `add` and `substract` so from the *array of shorty descriptions*, just one is choosen as *representative* to be emitted later at output phase. Instead of each record pointing at its own copy, all point at the position (or *index* in Dalvik's speak and in truth in common speak), of the only one copy that represents all of them.

#### Indices or Offsets

In a spatial referential system as the .dex file is, items have an inherent address, a numerical value of the item's distance to a zero point or the beginin of the referential system (the whole message or the output file's beginin). If an array has a fixed length to all of its items, then each item's address can be encoded in terms of the item position or *index*, assigned to in an item's attribute which name contains `_idx` in Côtehaus's source code. If the estipulation (expectation) about an array is to have variable length items, then Côtehaus tracks the address distance each item adds to the next one, called *offset* or `relative_offset` in the source code. Probably in accordance with this reasoning, the [Dalvik format][2] requests the *array of string ids* are index based, whilst the *array of list type arrays* are offset based, just to take 2 random cases.

Items (records) in `otype_array` (the array of list type arrays) have each of them a fixed length and an uniformly variable length (varies its length for it has more or less items but uniformly for each item is fixed sized). `relative_offset` is used as for this case as by name custom for all offset based arrays, and each item is assigned a value for its `relative_offset` at processing phase counting the offset and the overall length (fixed+variable) of the previous item over a sorted, [equivalence-class-represented][1] array, so that equivalent list type array items are neither counted nor emitted at output phase.

### Output phase

The program at the output phase carries with emitting an external representation in compliance to Dalvik's format. Right here, the the fitting of the datas in a spatial referential system overwhelms the data's meaning itself. after that fact, library calls are used that invert the relation format-meaning, turning the format into the meaning. This is just like you swap forefront and background on how you see a picture. To name one, a library call for such a purpose is on the function `htole32`, standing for _**h**ost representation **to** **l**ittle **e**ndian of **32** bits_. Library calls like that are made from the `pack_...` functions.

#### Alignment

Alignmemt is primarily performed at output phase, though distance between arrays to be aligned is based on numbers computed as records are being blown, during previous phases. `up_bounds_ary` has something to do with that along processing and output phases.

Sending messages to the output, whilst an aligment unaware version would direct to emit the bits for the next array start, with alignment on board some bits are emitted without meaning but pushing the next array to a position that satisfies a condition about its distance from the start of the message (the beginin of the output file). That condition is requested by the Dalvik's format spec. Those meaningless bits are called *padding*.

The padding length has this formula:

	padding_length=(next_item_alignment-last_up_bound_address%next_item_alignment)%next_item_alignment

+ *next\_item\_alignment* is a constant valued attribute, that changes only if the item changes (items as a prototype and a type list may have different alignment attributes so the value changes).
+ *last\_up\_bound\_address* is the uppest address of the last item. There is a relation between item arrays over which an array address is pushed up by the arrays that occupy an upper space, and this array pushes up the addresses of down-placed arrays. Côtehaus tracks last\_up\_bound\_address keeping the relation between item arrays.
+ `last_up_bound_address%next_item_alignment` tells the upper items' highest address excess respecting as how much it was to fall exactly aligned to next item alignment.
+ `(next_item_alignment-last_up_bound_address%next_item_alignment)%next_item_alignment`, the substraction takes the excess' complement, it gives how much length to make the next padding, except if there was no excess, in which case last reminder operation (that outside the parenthesis) takes on fixing up that.

#### Eliminating redundancies

When a set of [equivalence-class][1] items is equivalence-class-represented by any one of them, not only Côtehaus emitts just that one, but also all the references, either by index or by offset, are fixed at that one, because the choosen representant can be fully taken as the new copy to refer to, from everywhere. After that kind of factorization (boilerplate elimination) the break down of *prototype id* is: one copy of the many `I(II)` (`int add(int a, int b)`, `int substract(int a, int b)` et. al.) exists and points at (refer to) a *return type* (indeed a *type*) by index. Also it points at an *array of types* by offset, each of which points at a *type*, the same copy that that one *return type*. The *return type* points at a copy of *string id* by index that in turn points at a copy of *string data* for the type descriptor `I`, and no other copies exists in the output file as for *string data* as *string id* representing the `I` type.

# Still early stage / blurly looked

Vaguely speaking, pairs of source/bytecode that Côtehaus should be able to transform, from source to bytecode, are like the listed next. The bytecode listed next might be .apk wrapped, so the reader is meant to take into account only the .dex wrapped inside. Especially, ignoring the .apk PGP signature which is not in scope (don't get it wrong with the .dex header's MD5 hash, this one yes, ***is*** in scope).
- Planning Poker \[[Feature chart](https://f-droid.org/packages/saschpe.poker/)\] \[[source](https://f-droid.org/repo/saschpe.poker_170127_src.tar.gz)\] \[[binary](https://f-droid.org/repo/saschpe.poker_170127.apk)\]
- RadarWeather \[[Feature chart](https://f-droid.org/packages/org.woheller69.weather/)\] \[[source](https://f-droid.org/repo/org.woheller69.weather_35_src.tar.gz)\] \[[binary](https://f-droid.org/repo/org.woheller69.weather_35.apk)\]




[1]: <https://en.m.wikipedia.org/wiki/Equivalence_class> "Equivalence class"

[2]: <https://source.android.com/devices/tech/dalvik/dex-format> "Dalvik format specification"